About This Notebook

This notebook demonstrate how to use ML Workbench to create a regression model that accepts numeric and categorical data. This one shows "local run" mode, which does most work (except for exporting data from Bigquery) on Datalab's VM. And it uses only 0.3% of the data (about 200K instances). Next notebook demonstrates how to deal with large data (~70M instances) by running every steps in Google Cloud.

Execution of this notebook requires Google Datalab (see setup instructions).

The Data

We will use Chicago Taxi Trip Data. Using pickup location, drop off location, taxi company, the model we will build predicts the trip fare.

Sample Data from BigQuery and Export it to DataFrame

We will use the following query to get our training data.

Note that we convert weekday, day, hour to STRING so that they can be used as categorical instead of numeric features.


In [34]:
%%bq query --name texi_query
SELECT
  unique_key,
  fare,
  CAST(EXTRACT(DAYOFWEEK FROM trip_start_timestamp) AS STRING) as weekday,
  CAST(EXTRACT(DAYOFYEAR FROM trip_start_timestamp) AS STRING) as day,
  CAST(EXTRACT(HOUR FROM trip_start_timestamp) AS STRING) as hour,
  pickup_latitude,
  pickup_longitude,
  dropoff_latitude,
  dropoff_longitude,
  company
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE 
  fare > 2.0 AND fare < 200.0 AND
  pickup_latitude IS NOT NULL AND
  pickup_longitude IS NOT NULL AND
  dropoff_latitude IS NOT NULL AND
  dropoff_longitude IS NOT NULL AND
  taxi_id IS NOT NULL



In [35]:
# Sample 0.3% data, and split it into train/eval set.

import google.datalab.bigquery as bq
import numpy as np

sampling = bq.Sampling.random(percent=0.3)
job = texi_query.execute(sampling=sampling)
df = job.result().to_dataframe()
msk = np.random.rand(len(df)) < 0.95
train_df = df[msk]
eval_df = df[~msk]



In [36]:
print('Training set includes %d instances.' % len(train_df))
print('Eval set includes %d instances.' % len(eval_df))


Training set includes 203998 instances.
Eval set includes 10832 instances.

Save Data

Save the data for model training.


In [37]:
!mkdir -p ./taxi



In [38]:
train_df.to_csv('./taxi/train.csv', header=False, index=False)
eval_df.to_csv('./taxi/eval.csv', header=False, index=False)


Explore Data

Before we use the data, we need to explore it. In reality, data exploration/feature engineering is an iterative process. For example, the above query is impacted by data exploration (fare < 200, pickup_latitude IS NOT NULL, etc).

The following %%ml command defines the dataset, and also does exploration on the data with one overview and one facets view. Note that these views don't show up if this notebook is viewed from github, because it requires frontend files that are served from Google Cloud Datalab.


In [39]:
# This loads %%ml commands
import google.datalab.contrib.mlworkbench.commands



In [40]:
%%ml dataset create
format: csv
train: ./taxi/train.csv
eval: ./taxi/eval.csv
name: taxi_data
schema:
    - name: unique_key
      type: STRING
    - name: fare
      type: FLOAT
    - name: weekday
      type: STRING
    - name: day
      type: STRING
    - name: hour
      type: STRING
    - name: pickup_latitude
      type: FLOAT
    - name: pickup_longitude
      type: FLOAT
    - name: dropoff_latitude
      type: FLOAT
    - name: dropoff_longitude
      type: FLOAT
    - name: company
      type: STRING



In [41]:
%%ml dataset explore --overview
name: taxi_data


train data instances: 203998
eval data instances: 10832
Sampled 1000 instances for each.

In [42]:
%%ml dataset explore --facets
name: taxi_data


train data instances: 203998
eval data instances: 10832
Sampled 1000 instances for each.

Create Model with ML Workbench

The MLWorkbench Magics are a set of Datalab commands that allow an easy code-free experience to training, deploying, and predicting ML models. This notebook will take the sampled data and build a regression model. The MLWorkbench Magics are a collection of magic commands for each step in ML workflows: analyzing input data to build transforms, transforming data, training a model, evaluating a model, and deploying a model.

For details of each command, run with --help. For example, "%%ml train --help".

When the dataset is small, there is little benefit of using cloud services. This notebook will run the analyze, transform, and training steps locally. However, we will take the locally trained model and deploy it to ML Engine and show how to make real predictions on a deployed model. Every MLWorkbench magic can run locally or use cloud services (adding --cloud flag).

The next notebook in this sequence shows the cloud version of every command, and we will use full data.

Step 1: Analyze

The first step in the MLWorkbench workflow is to analyze the data for the requested transformations. Analysis in this case builds vocabulary for categorical features, and compute numeric stats for numeric features.


In [43]:
!rm -r -f ./taxi/analysis # Delete previous run results.



In [44]:
%%ml analyze
output: ./taxi/analysis
data: $taxi_data
features:
  unique_key:
    transform: key
  fare:
    transform: target   
  weekday:
    transform: one_hot
  day:
    transform: one_hot
  hour:
    transform: one_hot
  pickup_latitude:
    transform: scale    
  pickup_longitude:
    transform: scale
  dropoff_latitude:
    transform: scale
  dropoff_longitude:
    transform: scale
  company:
    transform: embedding
    embedding_dim: 10


Expanding any file patterns...
file list computed.
Analyzing file /content/datalab/docs/samples/contrib/mlworkbench/structured_data_regression_taxi/taxi/train.csv...
file /content/datalab/docs/samples/contrib/mlworkbench/structured_data_regression_taxi/taxi/train.csv analyzed.

Note in the above "features" config, "target" is required and has to be specified explicitly. "key" means the column is not used in model features, and just a pass-through. For others, there is a default transform chosen for each column.

Step 2: Transform

This step is optional as training can start from csv data (the same data used in the analysis step). The transform step performs some transformations on the input data and saves the results to a special TensorFlow file called a TFRecord file containing TF.Example protocol buffers. This allows training to start from preprocessed data. If this step is not used, training would have to perform the same preprocessing on every row of csv data every time it is used. As TensorFlow reads the same data row multiple times during training, this means the same row would be preprocessed multiple times. By writing the preprocessed data to disk, we can speed up training.

We run the transform step for the training and eval data.


In [45]:
!rm -r -f ./taxi/transform # Delete previous run results.



In [46]:
%%ml transform
output: ./taxi/transform
analysis: ./taxi/analysis
data: $taxi_data


WARNING:root:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.
WARNING:root:Couldn't find python-snappy so the implementation of _TFRecordUtil._masked_crc32c is not as fast as it could be.

Now define the transformed dataset.


In [47]:
%%ml dataset create
format: transformed
name: taxi_transformed_data
train: ./taxi/transform/train-*
eval: ./taxi/transform/eval-*



In [48]:
%%ml dataset explore
name: taxi_transformed_data


train data instances: 203998
eval data instances: 10832

Step 3: Training

MLWorkbench help build standard TensorFlow models without you having to write any TensorFlow code.


In [49]:
# Delete previous run results.
!rm -r -f ./taxi/linear_train
!rm -r -f ./taxi/dnn_train


Linear Regression Model

Let's build a linear model first.


In [50]:
%%ml train
output: ./taxi/linear_train
analysis: ./taxi/analysis
data: $taxi_transformed_data
model_args:
    model: linear_regression
    learning-rate: 0.1
    max-steps: 30000


TensorBoard was started successfully with pid 27128. Click here to access it.

You can click the link to Tensorboard to monitor the progress.

From Tensorboard, the last eval loss value is 50.9953, so RMSE is around 7.14 (sqrt(50.9953)).

Or, you may want to plot the events inside notebook for sharing or presentation.


In [51]:
from google.datalab.ml import Summary

summary = Summary('./taxi/linear_train')
summary.plot('loss')


DNN Regression Model

RMSE = 7.14 is not very impressive. Let's see if we can do better with a DNN regression model. Note that this time, we added a few parameters (hidden-layer-size1, hidden-layer-size2). For DNN models, you need to provide the number and size of hidden layers. Also, max-steps is not there, which means the training will run until it detects eval loss is no longer decreasing, or hit number of epochs limit (1000).


In [52]:
%%ml train
output: ./taxi/dnn_train
analysis: ./taxi/analysis
data: $taxi_transformed_data
model_args:
    model: dnn_regression
    hidden-layer-size1: 200
    hidden-layer-size2: 100


TensorBoard was started successfully with pid 37026. Click here to access it.


In [53]:
summary = Summary('./taxi/dnn_train')
summary.plot('loss')


Loss = 13.79, and RMSE is about 3.71. It seems DNN model performs much better than linear. It is not surprising because trip fare is probably not "very" linear to any features. Instead we need to build some non-linear activations into the model, which is exactly what DNN does.

Step 4: Evaluation using batch prediction

Below, we use the evaluation model and run batch prediction locally. Batch prediction is needed for large datasets where the data cannot fit in memory. For demo purpose, we will use the evaluation data again.


In [54]:
!rm -r -f ./taxi/batch_predict # Delete previous results.


There are two model dirs under our training dir: "evaluation_model" and "model". The difference between these two is that evaluation model expects input with target (truth) value, while regular model expects no target column. Evaluation model outputs the input target value as is. Because it outputs both target and predicted value, it is good for model evaluation.


In [55]:
!ls ./taxi/dnn_train/


evaluation_model  model  schema_without_target.json  train

In [56]:
%%ml batch_predict
model: ./taxi/dnn_train/evaluation_model/
output: ./taxi/batch_predict
format: csv
data:
  csv: ./taxi/eval.csv


local prediction...
INFO:tensorflow:Restoring parameters from ./taxi/dnn_train/evaluation_model/variables/variables
done.

In [57]:
!ls ./taxi/batch_predict


predict_results_eval.csv  predict_results_schema.json

Note that the "predict_results_schema.json" file includes the csv schema of "predict_results_eval.csv".


In [58]:
%%ml evaluate regression
csv: ./taxi/batch_predict/predict_results_eval.csv


Out[58]:
metric value
0 Root Mean Square Error 3.565291
1 Mean Absolute Error 1.896622
2 50 Percentile Absolute Error 1.290230
3 90 Percentile Absolute Error 3.567000
4 99 Percentile Absolute Error 12.834100

Prediction

The MLWorkbench supports running prediction and displaying the results within the notebook.

Local Prediction

Note that now we use the non-evaluation model below (./dnn_train/model) which takes input with no target column. The prediction data is taken from eval csv file with target (fare) column removed.


In [59]:
%%ml predict
model: ./taxi/dnn_train/model/
data:
  - 144b42f903352f760b969b3a7bca941fa7474b26,4,289,22,42.009018227,-87.672723959,42.009018227,-87.672723959,
  - 2c09f875e5a58220344e717c4276fd322ff3c3e6,1,307,0,41.912364354,-87.675062757,41.963374382,-87.67018455,Taxi Affiliation Services
  - b352a154e8670f35d4050d35be6b8c73222854fc,7,214,14,41.912364354,-87.675062757,41.891971508,-87.612945414,Taxi Affiliation Services
  - 2e84ad9967c1a07de42582679a2891b2ecacd3b0,7,38,1,41.912364354,-87.675062757,41.921877461,-87.66407824,


predicted unique_key company day dropoff_latitude dropoff_longitude hour pickup_latitude pickup_longitude weekday
6.540888 144b42f903352f760b969b3a7bca941fa7474b26 289 42.009018227 -87.672723959 22 42.009018227 -87.672723959 4
11.551096 2c09f875e5a58220344e717c4276fd322ff3c3e6 Taxi Affiliation Services 307 41.963374382 -87.67018455 0 41.912364354 -87.675062757 1
11.965454 b352a154e8670f35d4050d35be6b8c73222854fc Taxi Affiliation Services 214 41.891971508 -87.612945414 14 41.912364354 -87.675062757 7
7.704762 2e84ad9967c1a07de42582679a2891b2ecacd3b0 38 41.921877461 -87.66407824 1 41.912364354 -87.675062757 7

Online Prediction

Datalab includes a prediction client for online (deployed) models.

Deploy Model

We can deploy locally trained model online so it can serve prediction requests via http. To deploy a local model, we'll need:

  1. Enable Machine Learning API in your Google Cloud project.
  2. Have a staging GCS location ready.

In [60]:
# Create a staging GCS bucket

!gsutil mb gs://datalab-taxi-local-model-staging


Creating gs://datalab-taxi-local-model-staging/...

In [61]:
# Copy model files over.

!gsutil -m cp -r ./taxi/dnn_train/model gs://datalab-taxi-local-model-staging/model


Copying file://./taxi/dnn_train/model/assets.extra/features.json [Content-Type=application/json]...
Copying file://./taxi/dnn_train/model/saved_model.pb [Content-Type=application/octet-stream]...
Copying file://./taxi/dnn_train/model/variables/variables.data-00000-of-00001 [Content-Type=application/octet-stream]...
Copying file://./taxi/dnn_train/model/assets.extra/schema.json [Content-Type=application/json]...
Copying file://./taxi/dnn_train/model/variables/variables.index [Content-Type=application/octet-stream]...
\
Operation completed over 5 objects/550.4 KiB.                                    

In [62]:
%%ml model deploy
name: chicago_taxi.v1
path: gs://datalab-taxi-local-model-staging/model


Waiting for operation "projects/bradley-playground/operations/create_chicago_taxi_v1-1508445089708"
Done.

Build your own prediction client with Python

A common task is to call a deployed model from different applications. Below is an example of writing a python client to run prediction outside of Datalab.

For more information about model permissions, see https://cloud.google.com/ml-engine/docs/tutorials/python-guide and https://developers.google.com/identity/protocols/application-default-credentials .


In [63]:
import json
from oauth2client.client import GoogleCredentials
from googleapiclient import discovery
from googleapiclient import errors

# Store your project ID, model name, and version name in the format the API needs.
api_path = 'projects/{your_project_ID}/models/{model_name}/versions/{version_name}'.format(
    your_project_ID=google.datalab.Context.default().project_id,
    model_name='chicago_taxi',
    version_name='v1')

# Get application default credentials (possible only if the gcloud tool is
#  configured on your machine). See https://developers.google.com/identity/protocols/application-default-credentials
#  for more info.
credentials = GoogleCredentials.get_application_default()

# Build a representation of the Cloud ML API.
ml = discovery.build('ml', 'v1', credentials=credentials)

# Create a dictionary containing data to predict.
# Note that the data is a list of csv strings.
body = {
    'instances': [
        'cacd255b228cae40828feb8575b7d51d01f7c30e,7,201,21,41.912364354,-87.675062757,41.892042136,-87.63186395,',
        'd41200b7ad9f1ae499a27eacec13ccebd3f227e4,1,327,0,41.912364354,-87.675062757,41.949060526,-87.661642904,Northwest Management LLC',
        'd36e0da792ff7d075a31460945a473fd91f1770b,6,262,19,41.912364354,-87.675062757,41.914747305,-87.654007029,',
    ]
}

# Create a request
request = ml.projects().predict(
    name=api_path,
    body=body)

# Make the call.
try:
    response = request.execute()
    print('\nThe response:\n')
    print(json.dumps(response, indent=2))
except errors.HttpError, err:
    # Something went wrong, print out some information.
    print('There was an error. Check the details:')
    print(err._get_reason())


The response:

{
  "predictions": [
    {
      "predicted": 11.054550170898438, 
      "unique_key": "cacd255b228cae40828feb8575b7d51d01f7c30e"
    }, 
    {
      "predicted": 9.904248237609863, 
      "unique_key": "d41200b7ad9f1ae499a27eacec13ccebd3f227e4"
    }, 
    {
      "predicted": 7.496986389160156, 
      "unique_key": "d36e0da792ff7d075a31460945a473fd91f1770b"
    }
  ]
}

Clean up


In [64]:
%%ml model delete
name: chicago_taxi.v1


Waiting for operation "projects/bradley-playground/operations/delete_chicago_taxi_v1-1508445173136"
Done.

In [65]:
%%ml model delete
name: chicago_taxi


Waiting for operation "projects/bradley-playground/operations/delete_model_chicago_taxi-1508445215"
Done.

In [66]:
# Delete the GCS bucket
!gsutil -m rm -r gs://datalab-taxi-local-model-staging


Removing gs://datalab-taxi-local-model-staging/model/assets.extra/features.json#1508445065872883...
Removing gs://datalab-taxi-local-model-staging/model/assets.extra/schema.json#1508445065806192...
Removing gs://datalab-taxi-local-model-staging/model/saved_model.pb#1508445066055646...
Removing gs://datalab-taxi-local-model-staging/model/variables/variables.index#1508445070066488...
Removing gs://datalab-taxi-local-model-staging/model/variables/variables.data-00000-of-00001#1508445066062965...
/ [5/5 objects] 100% Done                                                       
Operation completed over 5 objects.                                              
Removing gs://datalab-taxi-local-model-staging/...

In [ ]: